189 research outputs found

    EC-PSI: Associating Enzyme Commission Numbers with Pfam Domains

    Get PDF
    International audienceWith the growing number of protein structures in the protein data bank (PDB), there is a need to annotate these structures at the domain level in order to relate protein structure to protein function. Thanks to the SIFTS database, many PDB chains are now cross-referenced with Pfam domains and enzyme commission (EC) numbers. However, these annotations do not include any explicit relationship between individual Pfam domains and EC numbers. This article presents a novel statistical training-based method called EC-PSI that can automatically infer high confidence associations between EC numbers and Pfam domains directly from EC-chain associations from SIFTS and from EC-sequence associations from the SwissProt, and TrEMBL databases. By collecting and integrating these existing EC-chain/sequence annotations, our approach is able to infer a total of 8,329 direct EC-Pfam associations with an overall F-measure of 0.819 with respect to the manually curated InterPro database, which we treat here as a " gold standard " reference dataset. Thus, compared to the 1,493 EC-Pfam associations in InterPro, our approach provides a way to find over six times as many high quality EC-Pfam associations completely automatically

    Collecte de données biologiques à partir de sources multiples et hétérogènes. Vers une structure de médiation conviviale et orientée source

    Get PDF
    Actes électroniques : http://www.lalic.paris4.sorbonne.fr/stic/octobre/programme0209.html. Colloque avec actes et comité de lecture. nationale.National audienceAprès un rappel des principales spécificités des sources de données biologiques, nous décrivons brièvement deux études que nous avons effectuées en lien avec deux problèmes biologiques et portant sur la collecte de données selon des scénarios pré-établis d'interrogation d'une succession de sources. Nous évoquons ensuite les grandes lignes du projet de recherche qui fait suite et qui se veut être une généralisation des travaux précédents afin d'aider le biologiste, face à un problème assez complexe de collecte de données, à identifier des sources pertinentes, le guider dans la définition du scénario de collecte et mettre en oeuvre ce scénario sur des données réelles. Nous explicitons les points d'ancrage de ce projet dans la problématique du web sémantique ainsi que le premier domaine d'application envisagé

    Computational Discovery of Direct Associations between GO terms and Protein Domains

    Get PDF
    International audienceBackground: Families of related proteins and their different functions may be described systematically using common classifications and ontologies such as Pfam and GO (Gene Ontology), for example. However, many proteins consist of multiple domains, and each domain, or some combination of domains, can be responsible for a particular molecular function. Therefore, identifying which domains should be associated with a specific function is a non-trivial task.Results: We describe a general approach for the computational discovery of associations between different sets of annotations by formalising the problem as a bipartite graph enrichment problem in the setting of a tripartite graph. We call this approach “CODAC” (for COmputational Discovery of Direct Associations using Common Neighbours). As one application of this approach, we describe “GODomainMiner” for associating GO terms with protein domains. We used GODomainMiner to predict GO-domain associations between each of the 3 GO ontology namespaces (MF, BP, and CC) and the Pfam, CATH, and SCOP domain classifications. Overall, GODomainMiner yields average enrichments of 15-, 41- and 25-fold GO-domain associations compared to the existing GO annotations in these 3 domain classifications, respectively.Conclusions: These associations could potentially be used to annotate many of the protein chains in the Protein Databank and protein sequences in UniProt whose domain composition is known but which currently lack GO annotation

    IntelliGO: a new vector-based semantic similarity measure including annotation origin

    Get PDF
    International audienceThe Gene Ontology (GO) is a well known controlled vocabulary describing the biological process, molecular function and cellular component aspects of gene annotation. It has become a widely used knowledge source in bioinformatics for annotating genes and measuring their semantic similarity. These measures generally involve the GO graph structure, the information content of GO aspects, or a combination of both. However, only a few of the semantic similarity measures described so far can handle GO annotations differently according to their origin (i.e. their evidence codes). RESULTS: We present here a new semantic similarity measure called IntelliGO which integrates several complementary properties in a novel vector space model. The coefficients associated with each GO term that annotates a given gene or protein include its information content as well as a customized value for each type of GO evidence code. The generalized cosine similarity measure, used for calculating the dot product between two vectors, has been rigorously adapted to the context of the GO graph. The IntelliGO similarity measure is tested on two benchmark datasets consisting of KEGG pathways and Pfam domains grouped as clans, considering the GO biological process and molecular function terms, respectively, for a total of 683 yeast and human genes and involving more than 67,900 pair-wise comparisons. The ability of the IntelliGO similarity measure to express the biological cohesion of sets of genes compares favourably to four existing similarity measures. For inter-set comparison, it consistently discriminates between distinct sets of genes. Furthermore, the IntelliGO similarity measure allows the influence of weights assigned to evidence codes to be checked. Finally, the results obtained with a complementary reference technique give intermediate but correct correlation values with the sequence similarity, Pfam, and Enzyme classifications when compared to previously published measures. CONCLUSIONS: The IntelliGO similarity measure provides a customizable and comprehensive method for quantifying gene similarity based on GO annotations. It also displays a robust set-discriminating power which suggests it will be useful for functional clustering. AVAILABILITY: An on-line version of the IntelliGO similarity measure is available at: http://bioinfo.loria.fr/Members/benabdsi/intelligo_project

    NRPS toolbox for the discovery of new nonribosomal peptides and synthetases

    Get PDF
    National audienceNonribosomal peptide synthetases are huge multi-enzymatic complexes synthesizing peptides, but not through the classical process of transcription and then translation. The synthetases are organised in modules, each one integrating an amino acid in the final peptide. The modules are divided in domains providing specialized activities. So, those enzymes are as diverse as their products. We present our toolbox designed to annotate them accurately and promising results obtained on some Burkholderia, Bacillus and Pseudomonas genomes

    HexServer: an FFT-based protein docking server powered by graphics processors

    Get PDF
    HexServer (http://hexserver.loria.fr/) is the first Fourier transform (FFT)-based protein docking server to be powered by graphics processors. Using two graphics processors simultaneously, a typical 6D docking run takes ∼15 s, which is up to two orders of magnitude faster than conventional FFT-based docking approaches using comparable resolution and scoring functions. The server requires two protein structures in PDB format to be uploaded, and it produces a ranked list of up to 1000 docking predictions. Knowledge of one or both protein binding sites may be used to focus and shorten the calculation when such information is available. The first 20 predictions may be accessed individually, and a single file of all predicted orientations may be downloaded as a compressed multi-model PDB file. The server is publicly available and does not require any registration or identification by the user

    Kbdock - Searching and organising the structural space of protein-protein interactions

    Get PDF
    International audienceBig data is a recurring problem in structural bioinformatics where even a single experimentally determined protein structure can contain several different interacting protein domains and often involves many tens of thousands of 3D atomic coordinates. If we consider all protein structures that have ever been solved, the immense structural space of protein-protein interactions needs to be organised systematically in order to make sense of the many functional and evolutionary relationships that exist between different protein families and their interactions. This article describes some new developments in Kbdock, a knowledge-based approach for classifying and annotating protein interactions at the protein domain level

    Extraction de données pharmacogénomiques à partir d'études cliniques : problématique

    Get PDF
    L'importance des variations individuelles dans les réactions aux médicaments devient un problème conséquent à la fois au niveau de la recherche pharmaceutique et au niveau médical. Notre projet de recherche vise à intégrer des données cliniques et génétiques issues d'études cliniques avec comme objectif d'en extraire une connaissance sur les relations existantes entre un génotype particulier et son action sur l'effet d'un médicament. Pour répondre à ce problème, nous cherchons des méthodes de fouille adaptées aux données biomédicales que nous souhaitons manipuler et capables d'intégrer les connaissances du domaine sous forme d'ontologie. Ce projet est l'objet d'une thèse qui a commencé en novembre 2004

    A Hybrid Approach to Identifying the Most Predictive and Discriminant Features in Supervised Classification Problems

    Get PDF
    International audienceIn this paper, we are interested in the predictive and discriminant nature of features in supervised classification problems. We discuss the notions of prediction and discrimination and propose a hybrid approach combining supervised classifiers, model explanation, multicriteria decision making and pattern mining for identifying the most predictive and discriminant features in a dataset. The explanation of models learned by supervised classifiers produces rankings of features according to various performance measures. Based on that, multicriteria decision making and pattern mining methods are used to, respectively, select the most important features and interpret their role in terms of prediction and discrimination. Finally, we present and discuss two experiments on public datasets illustrating the potential of the approach

    Explaining Multicriteria Decision Making with Formal Concept Analysis

    Get PDF
    International audienceMulticriteria decision making aims at helping a decision maker choose the best solutions among alternatives compared against multiple conflicting criteria. The reasons why an alternative is considered among the best are not always clearly explained. In this paper, we propose an approach that uses formal concept analysis and background knowledge on the criteria to explain the presence of alternatives on the Pareto front of a multicriteria decision problem
    corecore